Skip to content

Daisyden/artifacts #1630

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

Daisyden/artifacts #1630

wants to merge 14 commits into from

Conversation

daisyden
Copy link
Contributor

For ai_for_validation testing.

@daisyden
Copy link
Contributor Author

daisyden commented May 6, 2025

xpu-ops bot UT traige result for your refrence:

class case result error message triage result
test.distributed._composable.fsdp.test_fully_shard_autograd.TestFullyShardAutograd test_nontensor_activations failed RuntimeError: Process 0 exited with error code 10 and exception: ; RuntimeError: oneCCL: coll_param.cpp:455 validate: EXCEPTION: average operation is not supported for the scheduler path ; RuntimeError: oneCCL: coll_param.cpp:455 validate: EXCEPTION: average operation is not supported for the scheduler path "N/A"

@daisyden
Copy link
Contributor Author

daisyden commented May 6, 2025

xpu-ops bot UT traige result for your refrence:

class case result error message triage result
test.distributed._composable.fsdp.test_fully_shard_compile.TestFullyShardCompileCompute test_disable_compiling_hooks failed RuntimeError: Process 0 terminated or timed out after 300.0208730697632 seconds {'similar_issue_id': 'N/A', 'similar_issue_state': 'N/A', 'issue_owner': 'distributed_computing_team', 'issue_description': 'The unit test test_disable_compiling_hooks in test_fully_shard_compile.py is failing due to a timeout error. The test involves disabling compiling hooks in a distributed computing context, and the process is terminating after approximately 300 seconds, indicating a potential deadlock or resource management issue.', 'root_causes': [{'cause': 'Process communication failure in distributed environment', 'description': 'The test may be hanging due to improper communication between processes, leading to a timeout.'}, {'cause': 'Resource management issue', 'description': 'The test might not be handling resources correctly, causing a deadlock or infinite loop.'}, {'cause': 'Regression in code', 'description': 'A recent code change may have introduced a bug affecting the behavior of the compiling hooks when disabled.'}], 'suggested_solutions': [{'solution': 'Improve process management and error handling', 'description': 'Enhance the test to handle subprocesses better, including proper cleanup and error handling to prevent hangs.'}, {'solution': 'Increase logging and debugging', 'description': 'Add logging to identify where the test is getting stuck, aiding in pinpointing the exact issue.'}, {'solution': 'Review recent code changes', 'description': 'Examine recent modifications related to FSDP or compilation hooks that might have introduced the bug.'}]}
test.distributed._composable.fsdp.test_fully_shard_autograd.TestFullyShardAutograd test_nontensor_activations failed RuntimeError: Process 0 exited with error code 10 and exception: ; RuntimeError: oneCCL: coll_param.cpp:455 validate: EXCEPTION: average operation is not supported for the scheduler path ; RuntimeError: oneCCL: coll_param.cpp:455 validate: EXCEPTION: average operation is not supported for the scheduler path {'similar_issue_id': 'N/A', 'similar_issue_state': 'N/A', 'issue_owner': 'Distributed Training Team, oneCCL Maintainers', 'issue_description': 'The unit test test.distributed._composable.fsdp.test_fully_shard_autograd.TestFullyShardAutograd/test_nontensor_activations is failing due to a RuntimeError related to oneCCL not supporting the average operation in the scheduler path. This occurs during distributed training with FSDP, likely when attempting to average gradients across processes.', 'root_causes': [{'cause': "Unsupported average operation in oneCCL's scheduler path", 'description': "The test triggers an average operation that oneCCL doesn't support in the scheduler path, causing the failure."}, {'cause': 'Version mismatch or missing support in oneCCL', 'description': 'The version of oneCCL used may not support the required operation, indicating a need for an update or alternative method.'}], 'suggested_solutions': [{'solution': 'Update oneCCL to a compatible version', 'description': 'Upgrade oneCCL to a version that supports the average operation in the scheduler path.'}, {'solution': 'Modify the test to avoid unsupported operations', 'description': "Adjust the test to use alternative aggregation methods if the average operation isn't supported."}, {'solution': 'Investigate and apply patches', 'description': 'If the issue is known, apply existing patches or workarounds to resolve the error.'}]}

@daisyden
Copy link
Contributor Author

daisyden commented May 6, 2025

xpu-ops bot UT traige result for your refrence:

class case result error message triage result
test.distributed._composable.fsdp.test_fully_shard_compile.TestFullyShardCompileCompute test_disable_compiling_hooks failed RuntimeError: Process 0 terminated or timed out after 300.0208730697632 seconds
{'similar_issue_id': 'N/A', 'similar_issue_state': 'N/A', 'issue_owner': 'Infrastructure Team', 'issue_description': 'The unit test `test_disable_compiling_hooks` in `test_fully_shard_compile.py` is failing due to a timeout error. The test involves disabling compiling hooks in a distributed computing setup using FSDP. The error indicates that process 0 terminated or timed out after approximately 300 seconds, suggesting a potential issue with process management or synchronization within the test.', 'root_causes': [{'cause': 'Improper process management leading to deadlocks or indefinite waiting.', 'explanation': 'The test may not be correctly starting or joining all processes, causing some to hang or wait indefinitely.'}, {'cause': 'Resource availability issues affecting process performance.', 'explanation': 'Insufficient CPU or memory resources might cause processes to hang, though this is less likely in a controlled unit test environment.'}, {'cause': 'Deadlock in distributed process communication.', 'explanation': 'Processes may be waiting for each other without proper synchronization, leading to a deadlock.'}], 'suggested_solutions': [{'solution': 'Review and adjust process management in the test.', 'details': 'Ensure all processes are properly started and joined to prevent deadlocks and timeouts.'}, {'solution': 'Investigate resource usage during the test.', 'details': 'Monitor CPU and memory usage to rule out environmental issues affecting process performance.'}, {'solution': 'Debug the test with logging or a debugger.', 'details': 'Identify the exact point where the test hangs to pinpoint the cause of the timeout.'}]}

|
test.distributed._composable.fsdp.test_fully_shard_autograd.TestFullyShardAutograd | test_nontensor_activations | failed | RuntimeError: Process 0 exited with error code 10 and exception: ; RuntimeError: oneCCL: coll_param.cpp:455 validate: EXCEPTION: average operation is not supported for the scheduler path ; RuntimeError: oneCCL: coll_param.cpp:455 validate: EXCEPTION: average operation is not supported for the scheduler path |

{'similar_issue_id': 'N/A', 'similar_issue_state': 'N/A', 'issue_owner': 'PyTorch Distributed Training Team', 'issue_description': 'The unit test `test.distributed._composable.fsdp.test_fully_shard_autograd.TestFullyShardAutograd/test_nontensor_activations` is failing due to a `RuntimeError` related to oneCCL not supporting the average operation in the scheduler path. This occurs during the validation in `coll_param.cpp` at line 455, indicating an issue with distributed communication operations when using non-tensor activations with FSDP.', 'root_causes': ["The average operation is not supported in the scheduler path of oneCCL, which is used during the test's distributed computation.", "Potential issues in the implementation of FSDP's autograd handling with non-tensor activations leading to unsupported communication operations."], 'suggested_solutions': ['Investigate and modify the communication logic to either avoid the unsupported average operation or implement support for it in the scheduler path.', 'Review oneCCL documentation and PyTorch issue tracker for known issues or patches related to average operations in distributed training contexts.', 'Consider updating oneCCL or related libraries to ensure compatibility with the required operations.']}

|

@daisyden
Copy link
Contributor Author

daisyden commented May 6, 2025

Xpu-ops triage bot UT analaysis result for your reference, only analyzed unique errors:

test.distributed._composable.fsdp.test_fully_shard_compile.TestFullyShardCompileCompute . test_disable_compiling_hooks got failed with RuntimeError: Process 0 terminated or timed out after 300.0208730697632 seconds , triage_bot result:

{'similar_issue_id': 'FB7281558', 'similar_issue_state': 'RESOLVED', 'issue_owner': 'distributed_computing_team', 'issue_description': 'The unit test test_disable_compiling_hooks in test_fully_shard_compile.py is timing out after 300 seconds, likely due to improper handling of compiling hooks in a distributed computing environment using FSDP.', 'root_causes': [{'cause': 'Timeout threshold is too low for the test to complete, especially when dealing with distributed operations and hooks management.', 'evidence': 'The test exceeds the 300-second threshold, indicating a potential timeout issue.'}, {'cause': 'Improper management of compiling hooks leading to deadlocks or blocking operations during the test execution.', 'evidence': 'The test involves disabling hooks, which may not be handled correctly, causing the process to hang.'}], 'suggested_solutions': [{'solution': 'Increase the timeout threshold for the test to accommodate longer-running operations.', 'rationale': 'Adjusting the timeout allows the test to complete without premature termination.'}, {'solution': 'Refactor the test to handle compiling hooks more efficiently, ensuring proper cleanup and resource management.', 'rationale': 'Correct hook management prevents deadlocks and ensures timely test completion.'}]}

test.distributed._composable.fsdp.test_fully_shard_autograd.TestFullyShardAutograd . test_nontensor_activations got failed with RuntimeError: Process 0 exited with error code 10 and exception: ; RuntimeError: oneCCL: coll_param.cpp:455 validate: EXCEPTION: average operation is not supported for the scheduler path ; RuntimeError: oneCCL: coll_param.cpp:455 validate: EXCEPTION: average operation is not supported for the scheduler path , triage_bot result:

{'similar_issue_id': 'N/A', 'similar_issue_state': 'N/A', 'issue_owner': 'distributed_training_team', 'issue_description': 'The unit test test.distributed._composable.fsdp.test_fully_shard_autograd.TestFullyShardAutograd/test_nontensor_activations is failing due to a RuntimeError related to oneCCL not supporting the average operation in the scheduler path. This occurs during the validation step in coll_param.cpp:455, indicating an issue with distributed communication operations when using FSDP with non-tensor activations.', 'root_causes': [{'cause': 'The test attempts to perform an average operation that oneCCL does not support in the scheduler path, which is part of the distributed communication setup.', 'explanation': "The error arises because oneCCL's current implementation lacks support for the average operation in the scheduler path, which is crucial for certain distributed training scenarios."}, {'cause': "The interaction between FSDP's sharding mechanism and autograd's handling of non-tensor activations leads to an unsupported operation being called.", 'explanation': "FSDP's distributed sharding may require specific communication patterns that are not yet fully supported in oneCCL for certain operations, especially when non-tensor activations are involved in autograd computations."}], 'suggested_solutions': [{'solution': 'Modify the test to avoid using the average operation in the scheduler path or find an alternative method to achieve the required functionality without triggering the unsupported operation.', 'rationale': 'Adjusting the test to bypass the problematic operation can provide a temporary workaround while a more permanent fix is developed.'}, {'solution': "Collaborate with the oneCCL team to investigate and potentially add support for the average operation in the scheduler path, ensuring it aligns with FSDP's requirements.", 'rationale': "Enhancing oneCCL's capabilities would resolve the issue for all users encountering similar problems, providing a more robust solution."}, {'solution': "Review and update the distributed communication logic within FSDP to ensure compatibility with oneCCL's current capabilities, possibly by implementing fallback mechanisms for unsupported operations.", 'rationale': "This approach maintains FSDP's functionality while working within the constraints of the existing oneCCL implementation."}]}

@daisyden
Copy link
Contributor Author

daisyden commented May 6, 2025

Xpu-ops triage bot UT analaysis result for your reference, only analyzed unique errors:

  1. test.distributed._composable.fsdp.test_fully_shard_compile.TestFullyShardCompileCompute . test_disable_compiling_hooks got failed with RuntimeError: Process 0 terminated or timed out after 300.0208730697632 seconds , triage_bot result:
{'similar_issue_id': 'N/A', 'similar_issue_state': 'N/A', 'issue_owner': 'distributed_training_team, fsdp_maintainers', 'issue_description': 'The unit test test_disable_compiling_hooks in test_fully_shard_compile.py is failing with a RuntimeError indicating that process 0 terminated or timed out after approximately 300 seconds. This suggests an issue with the test environment or setup when the compiling hooks are disabled, potentially related to resource management, process communication, or test assertions.', 'root_causes': [{'cause': 'The test may be experiencing issues with process communication or resource management when compiling hooks are disabled, leading to a timeout.', 'evidence': 'The test failure indicates a process timeout, which often points to issues in distributed environments where processes may hang or fail to terminate.'}, {'cause': 'The test setup or teardown may not be handling resources correctly when compiling hooks are disabled, causing the test to hang or consume excessive resources.', 'evidence': 'The test involves distributed computation, where resource management is critical. Improper handling can lead to deadlocks or indefinite waiting.'}, {'cause': "The test's assertions or expected behavior may not align with the actual behavior when compiling hooks are disabled, leading to an indefinite wait for a condition that never occurs.", 'evidence': "The test failure could indicate that the test is waiting for a condition that isn't met, causing it to time out."}], 'suggested_solutions': [{'solution': 'Investigate the test setup and ensure that all resources are properly initialized and released when compiling hooks are disabled.', 'rationale': 'Proper resource management is crucial in distributed tests to prevent hangs or timeouts.'}, {'solution': "Review the test's assertions and expected behavior to ensure they correctly handle the scenario where compiling hooks are disabled.", 'rationale': "Misaligned assertions can cause tests to wait indefinitely for conditions that aren't met."}, {'solution': 'Add additional logging or debugging to the test to identify why the process is terminating or hanging, which can provide more insight into the root cause.', 'rationale': 'More detailed logs can help pinpoint the exact issue leading to the timeout.'}]}
  1. test.distributed._composable.fsdp.test_fully_shard_autograd.TestFullyShardAutograd . test_nontensor_activations got failed with RuntimeError: Process 0 exited with error code 10 and exception: ; RuntimeError: oneCCL: coll_param.cpp:455 validate: EXCEPTION: average operation is not supported for the scheduler path ; RuntimeError: oneCCL: coll_param.cpp:455 validate: EXCEPTION: average operation is not supported for the scheduler path , triage_bot result:
{'similar_issue_id': 'N/A', 'similar_issue_state': 'N/A', 'issue_owner': 'distrubuted_training_maintainers', 'issue_description': 'The unit test test.distributed._composable.fsdp.test_fully_shard_autograd.TestFullyShardAutograd/test_nontensor_activations is failing due to a RuntimeError related to oneCCL not supporting the average operation in the scheduler path. This indicates an issue with the communication library during distributed training with FSDP and non-tensor activations.', 'root_causes': ['The average operation is not supported in the scheduler path of oneCCL, which is used during distributed training with FSDP.', 'Potential issues in the coll_param.cpp file at line 455 where the average operation is being validated but not supported.'], 'suggested_solutions': ['Investigate and modify the communication logic to either support the average operation in the scheduler path or find an alternative method for gradient aggregation.', 'Review the coll_param.cpp file around line 455 to identify and correct any issues in the validation or implementation of the average operation.']}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants